import pandas as pdimport numpy as npimport missingno as mno # needed to visualize missing values. install missingno into conda if import does not work!import altair as altimport matplotlib.pyplot as pltimport xlrd # needed to read excel files. install xlrd into conda if import does not work!import shutil # needed to copy filesimport timeimport datetimeimport warningsimport joblibfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import KNNImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import Lassofrom sklearn.linear_model import LassoCVfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import RidgeCVfrom sklearn.metrics import r2_scorefrom sklearn.metrics import mean_squared_errorfrom sklearn.metrics import mean_absolute_errorwarnings.simplefilter(action='ignore', category=FutureWarning)alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
2 Introduction and data
The aim of this project is to investigate, whether there is a correlation between the household income and the death rate in the United States of America. In order to explore this relation, we gathered data on both topics and will analyse how and to what extense the death rate is impacted by the household income.
2.1 Research Question
Does the household income have an impact on the deathrates in the U.S. and if yes, how big is it?
Our hypotheses regarding the research question is:
The household income and the death rate will have a negative correlation.
Meaning, that the higher the household income is, the lower the death rate will be.
The predictor variable will be the median household income. The main response variable will be the age-adjusted death rate. Further insights can be gained by using categories like death cause, state or year. Other useful information will be provided by the amount of total deaths. The data dictionary below, is showing more details about the required variables.
We import the original data from the raw folder and transform the .xls file to a .csv file. We also declare our dataframes
Code
# Declare variablesexternal_data ='..\\data\\external\\'raw_data ='..\\data\\raw\\'interim_data ='..\\data\\interim\\'processed_data ='..\\data\\processed\\'# File namesorig_income_file ='Median_Household_Income_By_State_1990-2017.xls'target_income_file ='Median_Household_Income_By_State_1990-2017.csv'orig_death_file ='NCHS_-_Leading_Causes_of_Death__United_States.csv'# Save external median income data as csv in folder 'raw'# Read filexls_household_file = pd.read_excel(external_data+orig_income_file)# Save filexls_household_file.to_csv(raw_data+target_income_file,index =None, header=True)# Copy external leading cause of death data into folder 'raw'shutil.copy(external_data+orig_death_file, raw_data+orig_death_file)# Declare both dataframesdf_income = pd.read_csv(raw_data+target_income_file)df_death = pd.read_csv(raw_data+orig_death_file)
2.2.2 Data structure
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10868 entries, 0 to 10867
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 10868 non-null int64
1 113 Cause Name 10868 non-null object
2 Cause Name 10868 non-null object
3 State 10868 non-null object
4 Deaths 10868 non-null int64
5 Age-adjusted Death Rate 10868 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 509.6+ KB
In the death dataset we have 10868 cases and 6 columns. THe income dataset looks like this:
Table 102.30. Median household income, by state: Selected years, 1990 through 2017
Unnamed: 1
Unnamed: 2
Unnamed: 3
Unnamed: 4
Unnamed: 5
Unnamed: 6
Unnamed: 7
Unnamed: 8
Unnamed: 9
...
Unnamed: 22
Unnamed: 23
Unnamed: 24
Unnamed: 25
Unnamed: 26
Unnamed: 27
Unnamed: 28
Unnamed: 29
Unnamed: 30
Unnamed: 31
0 rows × 32 columns
2.2.3 Data corrections
Income Dataset
From the overview, we have seen that the df_income dataset needs to be cleaned. You can follow in the code section which steps are needed.
Code
# We only need the first 16 columns # and we can also drop the columns 4,6,8,10,12 and 14 since they only show NaN valuescolumn_lst = [0,1,2,3,5,7,9,11,13,15]df_income_corrected = df_income[df_income.columns[column_lst]]# We don't need row 0,2 and row 64-67row_drop_lst = [0,2,64,65,66,67]df_income_corrected = df_income_corrected.drop(row_drop_lst)# There are a few rows with empty entries left, which we can get rid off # since those rows have no state assigned to it, they are only used as separatorsdf_income_corrected.dropna(inplace=True)# The column names are actually in the first row, additionally they need to be adjustedcolumn_names = ['State','1990','2000','2005','2010','2013','2014','2015','2016','2017']df_income_corrected.columns = column_names# Row number 1 can be droppeddf_income_corrected.drop(1,inplace=True)# The dots in the state column can be removed# We also do not want any leading or ending spaces in the stringsdf_income_corrected = df_income_corrected.replace(r'\.','',regex=True)df_income_corrected['State'] = df_income_corrected['State'].str.strip()# Lastly we reset the row index drop the old index df_income_corrected.reset_index(drop =True, inplace =True)# We transform the table to show the median income for a state in a single year# We use the melt() function for thislst_years = column_names[1:]df_income_corrected = df_income_corrected.melt( id_vars= ['State'], value_vars= lst_years, var_name='year', value_name='median_household_income' )# The State column should be lowercasedf_income_corrected = df_income_corrected.rename( columns = {'State':'state'})# The types need to be declared, state holds categorial values, year has integers and income holds float numbersdf_income_corrected['state'] = df_income_corrected['state'].astype('category')df_income_corrected['year'] = df_income_corrected['year'].astype('int')df_income_corrected['median_household_income'] = df_income_corrected['median_household_income'].astype('float')
state
year
median_household_income
0
United States
1990
57500.0
1
Alabama
1990
45200.0
Death Dataset
From the overview of the death dataset, we need also to correct the data. You can follow in the code section, which steps are needed.
Code
# Make a copy to perform corrections ondf_death_corrected = df_death# Change column names to lowercasedf_death_corrected.columns = df_death_corrected.columns.str.lower()# Change spaces and the '-' to underscoresdf_death_corrected.columns = df_death_corrected.columns.str.replace(r' ','_',regex=True)df_death_corrected.columns = df_death_corrected.columns.str.replace(r'-','_',regex=True)# Remove any leading or trailing whitespaces for the object columnsdf_death_corrected['113_cause_name'] = df_death_corrected['113_cause_name'].str.strip()df_death_corrected['cause_name'] = df_death_corrected['cause_name'].str.strip()df_death_corrected['state'] = df_death_corrected['state'].str.strip()# Change Dtype of 113 Cause Name, Cause Name and State column to categorycols = df_death_corrected.select_dtypes(include='object').columns.to_list()df_death_corrected[cols] = df_death_corrected[cols].astype('category')file='Corrected_NCHS_-_Leading_Causes_of_Death__United_States.csv'df_death_corrected.to_csv(interim_data+file,index =None, header=True)
Joined Dataset
Now, we are able to add the median household income from the income dataset to the death dataset by adding the corresponding value to the correct year and state present in the death dataset.
Code
# Merge both dataframesdf_joined = pd.merge( df_death_corrected, df_income_corrected, how ='left', on = ['year','state'])# Save the joined datasetfile='Joined_Dataset.csv'df_joined.to_csv(processed_data+file,index =None, header=True)
We take a look at the joined dataset to see if we need to do anything before using it:
We will impute the missing values by using a KNN Imputation since that will typically result in a good imputation for numerical values. The data needs to be scaled in order for the algorithm to perform well.
After the imputation, we’ll have to use the inverse_transform() function from MinMaxScaler to bring the scaled dataset back in the original form. In the appendix, you can find the imputation analysis and statistics in detail.
Code
# Imputation# Only keep numerical columnscol_num = df_joined.select_dtypes(include=[np.number]).columns.to_list()# Original household median income original_df_joined = df_joined[col_num]# Scaled household median incomescaler = MinMaxScaler()scaled_df_joined = scaler.fit_transform(original_df_joined)scaled_df_joined = pd.DataFrame(data=scaled_df_joined, columns=original_df_joined.columns)# Imputeimputer_scaled = KNNImputer(n_neighbors=1)imputed_scaled = imputer_scaled.fit_transform(scaled_df_joined)# Convert to DataFramesimputed_scaled = pd.DataFrame(data=imputed_scaled, columns=original_df_joined.columns)# Inverse the scalingimputed_scaled = scaler.inverse_transform(imputed_scaled)imputed_scaled = pd.DataFrame(data=imputed_scaled, columns=original_df_joined.columns)
2.2.4 Variable lists
The detailed list of variables can be found in the data dictionary.
Code
# define outcome variable as y_labely_label ='age_adjusted_death_rate'# select featuresfeatures = df_joined.drop(columns=[y_label]).columns.tolist()# create feature data for data splittingX = df_joined[features]# create response for data splittingy = df_joined[y_label]
2.2.5 Data splitting
# Data SplitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# data training setdf_train = pd.DataFrame(X_train.copy())df_train = df_train.join(pd.DataFrame(y_train))
3 Analysis
We will focus on the correlation between the median income and the age-adjusted death rate over all death causes summarized. Although we suspect the death cause, year and the state to be indicators for variation on a more detailed level, this analysis would be beyond the scope of this project.
In order to test our hypotheses we will inspect summary statistics and use different visualizations in order to understand the relations between the predictor and response variables and gain further knowledge. First we copy the training dataframe. This way we prevent the training data to be changed during data exploration:
df_explore = df_train.copy()
3.1 Descriptive statistics
Let’s take a look at the summary statistics for the exploration dataframe:
count
mean
std
min
25%
50%
75%
max
year
7607.0
2007.97
5.47
1999.0
2003.0
2008.0
2013.0
2017.0
deaths
7607.0
14941.20
108968.33
21.0
616.0
1734.0
5802.5
2813503.0
median_household_income
7607.0
57785.01
9368.38
40000.0
50400.0
55800.0
63400.0
82400.0
age_adjusted_death_rate
7607.0
126.77
221.84
2.6
19.2
36.0
153.2
1061.2
We also look at the IQR for the features:
year 10.0
deaths 5186.5
median_household_income 13000.0
age_adjusted_death_rate 134.0
dtype: float64
We will further explore the data in the next segment and provide a detailed interpretation for the age adjusted death rate as well as the median household income.
3.2 Exploratory data analysis
The distribution for the age adjusted death rate looks like this:
<matplotlib.legend.Legend at 0x256f08e6460>
The statistic and distribution for the age adjusted death rate show:
75 % of the values are at or below 153.2 while the maximum goes up to 1061.2. This is a heavily right skewed distribution with alot of outliers since the IQR is at 134.
The distribution visualizes this effect but does not explain it yet. It also shows that it is bimodal.
The standard deviation shows that the values differ alot from the mean which can also be explained with the skew.
Our first interpretation is that the column death causes contains summarized values for every cause of death (described with the category ‘All causes’) which could lead to the effect seen in the table and visualization above.
We need to take a more detailed look to correctly interpret this distribution:
Here we can see that the reason for the distribution clearly because the cause_name feature has individual causes as well as a summary for all causes stored within the same column. We will make another dataframe filtered by ‘All causes’ to serve as a summary dataset:
To visualize the relation between age-adjusted death rate and median household income, we analyze scatterplots:
There seems to be a moderate to strong negative correlation between the two variables. We suspect the correlation to not be linear which we will investigate later.
We take a look at the correlation for the data filterd by ‘all causes’
Code
# inspect correlation for summarized cause name againcorr = df_explore_causes_summarized.corr()corr.style.background_gradient(cmap='Blues')
year
deaths
median_household_income
age_adjusted_death_rate
year
1.000000
0.033989
-0.078670
-0.446958
deaths
0.033989
1.000000
0.005858
-0.045224
median_household_income
-0.078670
0.005858
1.000000
-0.512797
age_adjusted_death_rate
-0.446958
-0.045224
-0.512797
1.000000
The correlation matrix shows, that there is a moderate to strong negative correlation for age-adjusted death rate and median_household_income. That means, that our hyphothesis is supported by the data. The data shows in addition, that the more years pass, the less people die (looking at data from 1999-2017) which could be explained by the improvements in the medical sector as well as the standard of living. To verifiy this assumption we would need further analysis and possibly more data, but this is not within the scope of out project. No other significant correlation is visible.
Since the predictor and the response variable are numeric and we try to find a pattern between them, we have a regression problem. We will start with simple linear regression even though we do not assume a linear relationship. In addition we use lasso regression and polynomial regression and compare the models to see which performs better.
We will only use the variable ‘all causes’ and filter the death rates by it, because we focus on the total deathrate in relation to the median household income.
Before we start we encode the categorical features as numeric features for the models and perform the data splitting again since we made changes. We will also drop the year column since we are not evaluating a time series and only use the extra data provided and the death column since it is already factor into the death rate.
Code
# drop 113_cause_name columndf_joined.drop(['113_cause_name', 'year'], axis =1, inplace =True)# Filter only for cause_name = 'all_causes'df_joined_summarized = df_joined[df_joined.cause_name =='All causes'].copy()# Drop cause name column since it only contains the same entry and death column since it is already df_joined_summarized.drop(['cause_name', 'deaths'], axis =1, inplace =True)# Reset index to fit new row numberdf_joined_summarized.reset_index(drop =True, inplace =True)# identify relevant numerical variableslist_num = df_joined_summarized.select_dtypes(include=[np.number]).columns.tolist()list_num.remove(y_label)#identify all categorical variableslist_cat = df_joined_summarized.select_dtypes(['category']).columns#convert all categorical variables to numericdf_joined_summarized[list_cat] = df_joined_summarized[list_cat].apply(lambda x: pd.factorize(x)[0])# Perform data splittingy_label ='age_adjusted_death_rate'features = df_joined_summarized.drop(columns=[y_label]).columns.tolist()X = df_joined_summarized[features]y = df_joined_summarized[y_label]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# data training setdf_train = pd.DataFrame(X_train.copy())df_train = df_train.join(pd.DataFrame(y_train))
4.1.1 Linear Regression
Select model
# select the linear regression modelreg_lin = LinearRegression()
Training and validation
# cross-validation with 5 folds only on median_household_income since we use linear regressionscores_lin = cross_val_score(reg_lin, X_train[['median_household_income']], y_train, cv=5, scoring='neg_mean_squared_error') *-1
Fit model
# Fit the model to the median_household_income feature in the training datareg_lin.fit(X_train[['median_household_income']], y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
First we need to standardize our numerical features, in order to use lasso regression
Code
# Make copies of the X matricesX_train_las = X_train.copy()X_test_las = X_test.copy()# standardize the dataframes in order for lasso regression to workscaler = StandardScaler().fit(X_train_las[list_num]) X_train_las[list_num] = scaler.transform(X_train_las[list_num])X_test_las[list_num] = scaler.transform(X_test_las[list_num])
Next we let the algorithm chose which alpha (~number of features used in the model) provide the best results
Code
# Lasso with 5 fold cross-validationreg_las = LassoCV(cv=5, random_state=0, max_iter=10000)# Fit modelreg_las.fit(X_train_las, y_train)# Set best alphareg_las = Lasso(alpha=reg_las.alpha_)# show best alphareg_las
Lasso(alpha=0.7649445649117792)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.7649445649117792)
The low value for alpha means that the lasso regression will use more features in the training of the model.
We then perform the same steps as we did with the linear regression model.
Code
# cross-validation with 5 foldsscores_las = cross_val_score(reg_las, X_train_las, y_train, cv=5, scoring='neg_mean_squared_error') *-1# Fit the model to the complete training datareg_las.fit(X_train_las, y_train)# obtain predictionsy_pred_las = reg_las.predict(X_test_las)# R squaredr2_las = r2_score(y_test, y_pred_las).round(3)# MSEMSE_las = mean_squared_error(y_test, y_pred_las).round(3)# RMSERMSE_las = mean_squared_error(y_test, y_pred_las, squared=False).round(3)# MAEMAE_las = mean_absolute_error(y_test, y_pred_las).round(3)
We can see that the Polynomial Regression model performs the best given our data. The Result section gives a more indepth analysis to the model results.
We have seen that polynomial regression performs the best given our data.
Model
R2
MSE
RMSE
MAE
0
Polynomial Regression
0.317
6191.236
78.684
63.386
Although these metrics are overall the best out of the three models that we ran, there is still alot of room left for improvement.
The error values (MSE, RMSE, MAE) show that the data spreads alot around the fitted line
R² is a measure for how well the model fits the data, more specific how much variation of the outcome variable (in our case the age adjusted death rate) can be explained with the predictor (the median household income).
An R² score of 0.317 means that 31.7 % of the variation for the age adjusted death rate can be explained by solely observing the median household income.
For a variable as complex as death rates we consider this to be at least a decent model when only taking the median household income as the sole predictor into account.
The plot underneath shows the final fit made by the polynomial regression. Ignoring the jagged line (a smoothing failure on our part), it depicts a quadratic fit to the data. It could be that a higher polynomial is a better fit when observing the relation between income and death rate, or that the death rate is too complex to only be described by one variable (which we strongly assume).
Plot
Code
# Make new dataframes for the altair chartsdf_scatter = X_test.join(y_test)# Make a series, set the index to the y_test series, make dataframe to join with X_testseries_y_pred_poly = pd.Series(y_pred_poly)series_y_pred_poly.index = y_test.indexdf_y_pred_poly = pd.DataFrame(series_y_pred_poly)df_y_pred_poly.rename(columns = {0:'age_adjusted_death_rate'}, inplace =True)df_line_poly = X_test.join(df_y_pred_poly)# Make plotsscatter = alt.Chart(df_scatter).mark_circle(size=60).encode( alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero =False)), alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero =False)), tooltip = ['median_household_income', 'age_adjusted_death_rate']).interactive()line_poly = alt.Chart(df_line_poly).mark_line(color='red').encode( alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero =False)), alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero =False)), tooltip = ['median_household_income', 'age_adjusted_death_rate']).interactive()# Put plots togetherplot_poly = scatter + line_polyplot_poly
6 Discussion + Conclusion
The correlation matrix in the relationships section as well as the scatterplot in the data exploration showed, that our hypothesis from the beginning is supported by the data. We have seen that the negative correlation between age-adjusted death rate and median household incomoe can be quantified with a value of r = -0.51. The polynomial regression model was the best one we tried out, although it still had low R². Improvement can be made by trying out more models or add more features (feature engineering / model ensembling).
With the models we used, our research question can be answered with a yes. The impact is quantified by the metric R² and we consider a value of 0.317 to be significant.
6.1 Outlook
We have also seen a significant correlation between the age-adjusted death rate and the years. This could be further analyzed and additional time series models could be built.
In the appendix we made a scatterplot for every death cause related to the median household income and saw that in certain states, certain death causes are more prominent than others. So the correlation between death rate and median household income could be interesting regarding the specific death cause.
7 Appendix
7.1 Additional Information regarding the research question
Our research question is backed by the following studies:
KINGE, Jonas Minet, et al. Association of household income with life expectancy and cause-specific mortality in Norway, 2005-2015. Jama, 2019, 321. Jg., Nr. 19, S. 1916-1925. (https://jamanetwork.com/journals/jama/article-abstract/2733322)
KAPLAN, George A., et al. Inequality in income and mortality in the United States: analysis of mortality and potential pathways. Bmj, 1996, 312. Jg., Nr. 7037, S. 999-1003. (https://www.bmj.com/content/312/7037/999.full)
O’CONNOR, Gerald T., et al. Median household income and mortality rate in cystic fibrosis. Pediatrics, 2003, 111. Jg., Nr. 4, S. e333-e339. (https://publications.aap.org/pediatrics/article-abstract/111/4/e333/63113/Median-Household-Income-and-Mortality-Rate-in)
Although the first study was done in Norway and the second study investigates mortality instead of death rate, we suspect to gather similar observations.
Added information on mortality rate:
Mortality is a fact that refers to susceptibility to death. While there is a crude death rate that refers to number of deaths in a population in a year, mortality rate is the number of deaths per thousand people over a period of time that is normally a year. (see: https://www.differencebetween.com/difference-between-death-rate-and-vs-mortality-rate/).
After joining, the median_household_income is now an additional column for the death data set.
We only have the median household income for the year 1990, 2000, 2005, 2010 and 2013-2017. In the death dataset, we find the years starting from 1999 until 2017. That is the reason why we only have 4576 non-null values for the median household income. This means that roughly 58 % of the column has empty values which we need to either fill (by imputing) or remove.
Since removing would lead to our dataset to be cut by over half, that is not a viable option.
First lets take a look at the summary statistics for the median household income and its distribution.
7.2.1 Summary Statistics
Code
# Dataframe with only median household income as columndf_household_income = df_imputation[['median_household_income']]# Summary statisticsdf_household_income.describe().round(2).T
# Distribution of median household incomedf_household_income.plot(kind='kde', figsize=(8, 5), title='Distribution of median household income before imputation')plt.xlabel('median household income')
# Scatterplot income over years before imputationscatter_income_before = alt.Chart(df_imputation).mark_circle(size=60).encode( x=alt.X('year', title='Year', scale=alt.Scale(domain = (1999,2018)) ), y=alt.Y('median_household_income', title='Median Household Income ($)', scale=alt.Scale(zero=False) ), tooltip=['year', 'median_household_income'],)scatter_income_before.title ='Median Household Income over the years'scatter_income_before.interactive()
Code
# Boxplot over all income values before imputationbox_income_before = alt.Chart(df_household_income).mark_boxplot(size =50).encode( x=alt.X('median_household_income', title='median household income', scale = alt.Scale(zero =False)), y=alt.Y()).properties( width =450, height =80)box_income_before
The distribution is unimodal and right skewed extending from 40k $ to over 80k $.
The median is found at 56350 $ and with an IQR at 12950 there are no outliers present in the income data.
Also the values for the years 1999, 2001-2004, 2006-2009, 2011 & 2012 are missing.
7.2.4 Imputed dataset
Let us take a look what the imputation has done for the missing values and how we can use the result:
Summary statistics
Code
# Compare the original and imputed median income columndf_income = df_imputation[['median_household_income']].copy()df_income['income_KNN_Scaled'] = imputed_scaled['median_household_income']# Obtain summary statisticsdf_income.describe().T[['mean', 'std', 'min', '50%', 'max']]
# Distribution of median household incomedf_income.plot(kind='kde', figsize=(8, 5), title='Distribution of median household income after imputation')plt.xlabel('median household income')
Text(0.5, 0, 'median household income')
7.2.6 Plot
Code
# Scatterplot income over years after imputationscatter_income_after = alt.Chart(imputed_scaled).mark_circle(size=60).encode( x=alt.X('year', title='Year', scale=alt.Scale(domain = (1999,2018)) ), y=alt.Y('median_household_income', title='Median Household Income ($)', scale=alt.Scale(zero=False) ), tooltip=['year', 'median_household_income']).interactive()# Vertical concatenation of scatterplots before and after imputationalt.vconcat(scatter_income_before, scatter_income_after)
Code
# Boxplot over all income values after imputationbox_income_after = alt.Chart(imputed_scaled).mark_boxplot(size =50).encode( x=alt.X('median_household_income', title='median household income', scale = alt.Scale(zero =False)), y=alt.Y()).properties( width =450, height =80)# Vertical concatenation of boxplots before and after imputationchart = alt.vconcat(box_income_before, box_income_after)chart = chart.configure_view( strokeWidth=0).configure_title( fontSize=20, font='Courier', anchor='start', offset=20)chart.title ='Income boxplot before and after imputation'chart
From the summary statistics we can see that the imputed values represent the original values quite good, the values are close to the existing ones. The IQR is slightly greater but there are still no outliers after the imputation. This can be explained with the method we used, since the KNN algorithm imputes the values by calculating distances to existing neighbour data points for every instance we want to replace.
We found that by setting the number of neighbours (n_neighbours) to 1, the imputed data represents the original data the best without making assumptions. Since we are no domain experts, we want to change the data as little as possible.
The scatterplot shows that the missing values for median household income have been copied from years with existing values. For example the years 1999 - 2002 all share the values from 2000, 2003 - 2007 the values from 2005 and 2008 - 2011 the values from 2010. 2012 has been cipied from 2013. This has the advantage, that it maintains the spread in the data. The disadvantage is, that we make the implicit assumption that the median household income by state stayed the same for the years close to 2000, 2005 and 2010.
This is visible by comparing the boxplots before and after imputation which are very similar.
In reality the median income almost certainly did not stay the same, but for our case this gives a decent estimation for training our model.
This will leave us with 6 columns in total, which we can use for our data analysis (we will later disregard the 113 Cause name column since we will see, that the same information is provided by the cause name column):
7.3 Additional data exploration
The scatterplots for the relation between median household income and age-adjusted death rate for every single category looks like this:
These charts are showing all death causes in detail. We have one chart for each death cause. Also there is the same negative correlation as in the plot for all death causes.
It seems like, that the death cause is only slightly influencing this correlation. Cancer seems like the death cause with the strongest negative correlation. This could be used for further analysis and model training considering a specific death cause.
Code
chart=alt.Chart(df_imputation).mark_circle().encode( alt.X('year', scale=alt.Scale(zero=False)), alt.Y('age_adjusted_death_rate', title='age-adjusted death rate', scale=alt.Scale(zero=False, padding=1)), color='state', )chart.title ="Age-adjusted death rate in differnt years and states"chart
This chart could be interesting, if we could extend it and visualize the death rate per state and year.
Code
chart=alt.Chart(df_imputation).mark_circle().encode( alt.X('year', scale=alt.Scale(zero=False)), alt.Y('median_household_income', title='median household income', scale=alt.Scale(zero=False, padding=1)), color='state', )chart.title ="Development of median household income over the years by states"chart
Code
chart=alt.Chart(df_imputation).mark_bar(opacity=0.7).encode( x=alt.X('cause_name:O', title ='cause name'), y=alt.Y('age_adjusted_death_rate:Q', title='age-adjusted death rate', stack=None), color="state", tooltip=['state'] ).transform_filter( {'not':alt.FieldEqualPredicate(field='cause_name', equal='All causes')})chart.title ="Causes for age-adjusted death rates in different states "chart
This graph is showing us, that heart disease in Arizona and cancer in South Dakota, are with distance the top death causes regarding death rate. We remember, that cancer has a really strong negative correlation with the median income.
7.4 Additional relation analysis
Code
#identify all categorical variableslist_cat = df_explore.select_dtypes(['category']).columns#convert all categorical variables to numericdf_explore[list_cat] = df_explore[list_cat].apply(lambda x: pd.factorize(x)[0])
Code
# take a look at all correlationscorr = df_explore.corr()corr.style.background_gradient(cmap='Blues')
year
113_cause_name
cause_name
state
deaths
median_household_income
age_adjusted_death_rate
year
1.000000
0.005480
0.005480
0.002946
0.013353
-0.077663
-0.027888
113_cause_name
0.005480
1.000000
1.000000
0.007820
0.097992
-0.015742
0.402470
cause_name
0.005480
1.000000
1.000000
0.007820
0.097992
-0.015742
0.402470
state
0.002946
0.007820
0.007820
1.000000
0.004487
-0.023781
0.005450
deaths
0.013353
0.097992
0.097992
0.004487
1.000000
0.005334
0.224213
median_household_income
-0.077663
-0.015742
-0.015742
-0.023781
0.005334
1.000000
-0.031676
age_adjusted_death_rate
-0.027888
0.402470
0.402470
0.005450
0.224213
-0.031676
1.000000
We can see that both features 113_cause_name and cause_name provide the same information. Like we predicted we only need to keep one column for the model. Therefore we will drop the column ‘113_cause_name’.
We can ignore the obvious correlations between the age_adjusted_death_rate and cause_name, or deaths as well as deaths and cause_name, since those do not provide extra information.
7.5 Additional model result analysis
7.5.1 Linear Regression
Cross Validation
Code
# store cross-validation scoresdf_scores_lin = pd.DataFrame({"lr": scores_lin})# reset index to match the number of foldsdf_scores_lin.index +=1# visualize MSE for each Foldchart=alt.Chart(df_scores_lin.reset_index()).mark_line( point=alt.OverlayMarkDef()).encode( x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)), y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)"))chart.title ="MSE by Folds (linear regression)"chart
Code
df_scores_lin.describe().T
count
mean
std
min
25%
50%
75%
max
lr
5.0
7462.394258
1197.570953
6063.244422
6370.508411
7715.406685
8485.279789
8677.531982
Coefficients
Code
# interceptintercept = pd.DataFrame({"Name": ["Intercept"],"Coefficient":[reg_lin.intercept_]} )# make a slope tableslope = pd.DataFrame({"Name": ["median_household_income"],"Coefficient": reg_lin.coef_})# combine estimates of intercept and slopestable = pd.concat([intercept, slope], ignore_index=True, sort=False)round(table, 3)
Name
Coefficient
0
Intercept
1087.116
1
median_household_income
-0.005
Plot
Code
# Make new dataframes for the altair charts# Make a series, set the index to the y_test series, make dataframe to join with X_testseries_y_pred_lin = pd.Series(y_pred_lin)series_y_pred_lin.index = y_test.indexdf_y_pred_lin = pd.DataFrame(series_y_pred_lin)df_y_pred_lin.rename(columns = {0:'age_adjusted_death_rate'}, inplace =True)df_line_lin = X_test.join(df_y_pred_lin)# Make plotsline_lin = alt.Chart(df_line_lin).mark_line(color='red').encode( alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero =False)), alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero =False)), tooltip = ['median_household_income', 'age_adjusted_death_rate']).interactive()# Put plots togetherplot_lin = scatter + line_linplot_lin.title ="Scatter chart with linear regression"plot_lin
7.5.2 Lasso Regression
Cross Validation
Code
# store cross-validation scoresdf_scores_las = pd.DataFrame({"lr": scores_las})# reset index to match the number of foldsdf_scores_las.index +=1# visualize MSE for each Foldchart=alt.Chart(df_scores_las.reset_index()).mark_line( point=alt.OverlayMarkDef()).encode( x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)), y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)"))chart.title ="MSE by Folds (lasso regression)"chart
Code
df_scores_las.describe().T
count
mean
std
min
25%
50%
75%
max
lr
5.0
7507.486421
1198.093726
6070.625075
6469.607034
7727.331895
8566.590051
8703.278049
Coefficients
Code
# interceptintercept = pd.DataFrame({"Name": ["Intercept"],"Coefficient":[reg_las.intercept_]} )# make a slope tableslope = pd.DataFrame({"Name": features,"Coefficient": reg_las.coef_})# combine estimates of intercept and slopestable = pd.concat([intercept, slope], ignore_index=True, sort=False)round(table, 3)
Name
Coefficient
0
Intercept
802.619
1
state
-0.007
2
median_household_income
-46.177
This shows that the lasso regression in our case uses the same features to train the model (only median household income).
Plot
Code
# Make new dataframes for the altair charts# Make a series, set the index to the y_test series, make dataframe to join with X_testseries_y_pred_las = pd.Series(y_pred_las)series_y_pred_las.index = y_test.indexdf_y_pred_las = pd.DataFrame(series_y_pred_las)df_y_pred_las.rename(columns = {0:'age_adjusted_death_rate'}, inplace =True)df_line_las = X_test.join(df_y_pred_las)# Make plotsline_las = alt.Chart(df_line_las).mark_line(color='red').encode( alt.X('median_household_income', title='Median household income ($)', scale = alt.Scale(zero =False)), alt.Y('age_adjusted_death_rate' , title='Age-adjusted death rate', scale = alt.Scale(zero =False)), tooltip = ['median_household_income', 'age_adjusted_death_rate']).interactive()# Put plots togetherplot_las = scatter + line_lasplot_las.title ="Scatter chart with lasso regression "plot_las
7.6 Further ideas for visualizations and analysis
Further analysis:
How is the median income distributed year by year in South Dakoka?
How is the household income developing year by year in the different states? Are there high variances? Is there a relationship to the death rates?
The following models could be interesting, because the models we have choosen were not ideal:
Bayesian Regression
Decision Tree Regression, mainly xgboost
Gradient Descent Regression
In order to find the best performing model, they could be compared using a specific metric (e.g. R², Mean Squared Error or Mean Absolute Error).
Also a model ensemble would be able to perform even better since it uses the strength of different models.
It could be also interesting to carry out model fits, at the level of individual years and compare the metrics with each other, in order to be able to make statements over the years.
Another option is to gather more data, perform feature engineering and build model ensembles to improve overall performance.